Old Dominion University – Micro-blog Mapper and Epidemic Investigator

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations:

Kalpesh Padia, Old Dominion University, kpadia@cs.odu.edu [PRIMARY contact]

Dr. Michele C. Weigle, Old Dominion University [Faculty Advisor], mweigle@cs.odu.edu

Tool(s):

All tools used in solving this challenge were developed in-house. These tools were developed using Perl, Microsoft C# 3.0, Ruby on Rails, JavaScript and jQuery on these platforms viz. Microsoft .NET framework 4.0 and Rails. MySQL database was used for storing the various micro-blogs and calculated metadata. These tools were developed over a period of 1 week in the month of May 2011 at Old Dominion University in Norfolk, Virginia. The primary visualization was developed as a .NET desktop application while a rails application was created as a secondary visualization. The rails application was hosted on Apache server installed with Phusion Passenger mod running on Ubuntu.

Video:

 Video can be found here.

 

ANSWERS:


MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

We conclude that the outbreak started from Uptown, picked up at Downtown (Figure 1) spreading first to Eastside and later infected large number of people in other parts of the city (Figure 2).

First the huge dataset was filtered to prune noise and a new dataset containing only records hinting illness was generated using a Perl script with focus on keywords related to symptoms mentioned in the task. Only references to bloggers’ own illness were considered. Any use of term “pain” in the context of emotional pain was removed manually. Filtered dataset was input to the .NET application for visualization. Though a few complained of flu as early as April 30th it doesn’t become an epidemic until May 18th (Figure 1) when it quickly starts to spread to different parts of the city.

Figure 1. Epidemic spread on May 18, 2011 morning.

Figure 2. Epidemic spread on May 19, 2011.


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

We believe that this illness is spreading primarily from person-to-person due to coughing and sneezing, similar to normal flu. As a result, the outbreak is not contained to a specific region and people in different parts of the city (and potentially outside the city – if infected individuals choose to travel) risk being infected. It is therefore necessary for emergency management personnel to deploy treatment resources outside the affected area.

To analyze the spread of epidemic, we first filtered the dataset and then fed it as input to our .NET application to visualize the location of infected persons on the city map as time progresses. This gave us a basic idea about how infected the various parts of the city were, how fast the epidemic was spreading and also allowed us to figure out if the weather played any role in the spread of the epidemic.

In the beginning, the infection appears to be spreading evenly across the city (Figure 3) but later we observe that more people are infected first at Eastside, and later at Villa, Westside, Smogtown and Plainsville, followed by more infection in other parts of the city. The peak of the epidemic appears to be on May 18th when a large number of people get infected at Downtown and Eastside (Figure 4) while the wind is blowing towards West at high speed. Correlating these observations with the weather data, we conclude that weather does not play a role in spread of infection. Further hardly anyone fell ill around the lakes and reservoir suggesting that the epidemic was not waterborne as well.

Figure 3. Epidemic spread around May 8, 2011.

Figure 4. Epidemic spread in the afternoon of May 18th, 2011.

We then decided to correlate this observation with ratio of daytime population to population density in various parts of the city. While Uptown and Downtown have high ratio of 3.9 and 2.9 respectively, the ratio of Cornertown, Villa and Smogtown is closer to 1. The ratio for other parts of the city is 0.8 or less. On May 18th number of infected people at Downtown and Eastside grows steadily during daytime until 6 PM while no one (save a few) fell ill in the west part of the city. The high daytime density ratio of these regions is probably responsible for this. After 6 PM infection starts to pick up in the various parts of city (with smaller ratios) as well. This further suggests that as people return to their homes, they infect others. Over the next two days, as the large number of infected people are fatigued and stay at home, they infect more people at Villa, Smogtown and Plainsville (Figure 5). High density ratio of Westside makes it the new hub of infection alongside Downtown where people continue to get infected. Also, it can be observed that by mid-day of May 18th, people in Downtown and Eastside have already started to seek medical attention at the hospitals while people in other parts of the city do the same over the next two days. These observations conclude that it is transmitted from person to person.

Figure 5. Epidemic spread on May 20th, 2011. Note that people have already started to seek medical attention.

Further we imported our filtered dataset into a MySQL database and calculated term frequencies (TF) and term frequencies-inverse document frequencies (TF-IDF) for the terms appearing in the dataset. For the purpose of calculation of TF and TF-IDF, we considered all blogs generated in a day to be a single document which created a document corpus containing 21 documents, one for each day. We created a web interface using Ruby on Rails, jQuery and JavaScript to visualize the top terms in the dataset as a time cloud with the hope of observing some trends in illness over time (Figure 6).

Figure 6. TF and TF-IDF time cloud on filtered dataset for May 18th, 2011.

While TF was helpful to some extent in observing the severity of the symptoms over time, TF-IDF was not as it bubbled the not-so common terms upwards. We noticed that while people complain about body pain, fever and flu on almost daily basis, coughing as a symptom becomes prominent every other day. Also the symptoms get worse on May 18th when the frequencies of these terms become much higher and also the term “worse”, “night” and “wish” appear hinting at worsening symptoms. Over the next two days the patients also develop diarrhea. We found this interface was good for observing a trend in symptoms and it further supported our hypothesis that the illness was being spread due to coughing from person-to-person.

We also imported the entire data set into the database and created another web interface to visualize the movement of an infected individual across the city over time after he becomes infected (Figure 7). We can either select a user id to plot a user’s movement after infection across the city or select a date to see who all were infected on a particular day. Also, an individual can be located on the map on a specific date after he has been infected. We observed that most people have been moving across the city once they are infected, thus infecting more people.

Figure 7. Plot showing how user 89943 moves across city after getting infected on May 5th, 2011.

Notice that the user visits various parts of the city over the next 15 days possibly transmitting the infection.

All above observations support our hypothesis that the epidemic is spreading from person-to-person and that it is not contained to a specific region.